.

Chatting with our data

** Lets meet a very nice guy: dplyr **

  1. We are going to need take some variable

  2. and make a new one from it

  3. Perhaps, there are going to be sometimes when we would like to know the possible cases that would come from the combination of two categorical variables and calculate the frequency of each scenario (distribution) and evenresume the information than’d come from it

  4. Another times, we’d just need to order our dataset from some specific variable or variables in order to watch the biggest (or the smallest) ones at above (or at the bottom)

  5. And finally what we’d want to tell our data is that we only want to see those cases (rows) than would fulfill some condition

So, translating …

Quoting from - dplyr’s github:

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  1. select() picks variables based on their names.

  2. mutate() adds new variables that are functions of existing variables.

  3. summarise() reduces multiple values down to a single summary.

  4. arrange() changes the ordering of the rows.

  5. filter() picks cases based on their values.

# library("nycflights13")
# saveRDS(flights,"data/flights.rds")
# saveRDS(airlines,"data/airlines.rds")
library(tidyverse)
── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
# library("nycflights13")
# saveRDS(flights,"data/flights.rds")
# saveRDS(airlines,"data/airlines.rds")
library(tidyverse)
mpg

select

# Drop unimportant variables so it's easier to understand the join results.
flights %>% select(year:day, hour, origin, dest, tailnum, carrier)

mutate

flights %>% summarise(I_tarde=if_else(arr_delay>0,1,0))
Error: Column `I_tarde` must be length 1 (a summary value), not 336776

arrange

filter

sum(is.na(flights2$I_tarde)); print(paste0("The 'NA's represent the: ", round(sum(is.na(flights2$I_tarde))/nrow(flights2),4)*100,"% of the complete Universe"))
[1] 9430
[1] "The 'NA's represent the: 2.8% of the complete Universe"

Option 1: Take them out of here

Option 2: Special Value

flights2_opt2; sum(is.na(flights2_opt2$I_tarde)); print(paste0("The 'NA's represent the: ", round(sum(is.na(flights2_opt2$I_tarde))/nrow(flights2_opt2),4)*100,"% of the Universe")); sum(flights2_opt2$I_tarde==-9999); print(paste0("The Special Values represent the: ", round(sum(flights2_opt2$I_tarde==-9999)/nrow(flights2_opt2),4)*100,"% of the Universe"))
[1] 0
[1] "The 'NA's represent the: 0% of the Universe"
[1] 9430
[1] "The Special Values represent the: 2.8% of the Universe"

Option N & beyond: More sophisticated stuff

print(important_mens)
[1] " We won't see any of these :D   "

Grouped operations

group_by

  • Counting number of flights per day

```

References

Cheatsheet

LS0tCnRpdGxlOiAiIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIGRmX3ByaW50OiBwYWdlZAogIGh0bWxfZG9jdW1lbnQ6CiAgICBkZl9wcmludDogcGFnZWQKLS0tCgojIDxhPjxpbWcgc3JjPSd3d3cvYmJ2YV91YWR5MDIucG5nJyBhbGlnbj0icmlnaHQiIGhlaWdodD0iMTM5IiAvPjwvYT4KICAgLiA8YnIvPgoKI0NoYXR0aW5nIHdpdGggb3VyIGRhdGEKCioqIExldHMgbWVldCBhIHZlcnkgbmljZSBndXk6ICpgZHBseXJgKiAqKgoKLSBNYXliZSwgdGhlIG1vc3Qgb2Ygb3VyIHRpbWUsIHdpbGwgYmUgc3BlbnQgb24gYHRhbGtpbmcgdG8gb3VyIGRhdGFgCgotIE9uZSBvZiB0aGUgbW9zdCBpbXBvcnRhbnQgdHlwZSBvZiB3b3JkcyBvbiBvdXIgbGFuZ3VhZ2VzIGFyZSB0aGUgYHZlcmJzYAoKLSBUaGlzIG1ha2Ugc2Vuc2UgYmVjYXVzZSB0aGV5IGBleGVjdXRlIGFjdGlvbnNgLCBhbmQgdGhhdCBpcyB3aGF0IHdlJ2xsIHdhbnQKCjEuIFdlIGFyZSBnb2luZyB0byBuZWVkIGB0YWtlYCBzb21lICp2YXJpYWJsZSogCgoyLiBhbmQgYG1ha2VgIGEgbmV3IG9uZSBmcm9tIGl0CgozLiBQZXJoYXBzLCB0aGVyZSBhcmUgZ29pbmcgdG8gYmUgc29tZXRpbWVzIHdoZW4gd2Ugd291bGQgbGlrZSB0byBrbm93IHRoZSBwb3NzaWJsZSBjYXNlcyB0aGF0IHdvdWxkIGNvbWUgZnJvbSB0aGUgY29tYmluYXRpb24gb2YgdHdvIGNhdGVnb3JpY2FsIHZhcmlhYmxlcyBhbmQgY2FsY3VsYXRlIHRoZSBmcmVxdWVuY3kgb2YgZWFjaCBzY2VuYXJpbyAoYGRpc3RyaWJ1dGlvbmApIGFuZCBldmVuYCByZXN1bWVgIHRoZSBpbmZvcm1hdGlvbiB0aGFuJ2QgY29tZSBmcm9tIGl0Cgo0LiAgQW5vdGhlciB0aW1lcywgd2UnZCBqdXN0IG5lZWQgdG8gYG9yZGVyYCBvdXIgZGF0YXNldCBmcm9tIHNvbWUgc3BlY2lmaWMgdmFyaWFibGUgb3IgdmFyaWFibGVzIGluIG9yZGVyIHRvIHdhdGNoIHRoZSBgYmlnZ2VzdGAgKG9yIHRoZSBgc21hbGxlc3RgKSBvbmVzIGF0IGFib3ZlIChvciBhdCB0aGUgYm90dG9tKQoKNS4gQW5kIGZpbmFsbHkgd2hhdCB3ZSdkIHdhbnQgdG8gKnRlbGwqIG91ciBkYXRhIGlzIHRoYXQgd2Ugb25seSB3YW50IHRvIHNlZSB0aG9zZSBjYXNlcyAoKnJvd3MqKSB0aGFuIHdvdWxkIGZ1bGZpbGwgc29tZSBgY29uZGl0aW9uYAoKCiMjIFNvLCB0cmFuc2xhdGluZyAuLi4KCioqUXVvdGluZyBmcm9tIC0gWypkcGx5cidzIGdpdGh1YipdKGh0dHBzOi8vZ2l0aHViLmNvbS90aWR5dmVyc2UvZHBseXIvYmxvYi9tYXN0ZXIvUkVBRE1FLm1kKToqKgoKZHBseXIgaXMgYSBncmFtbWFyIG9mIGRhdGEgbWFuaXB1bGF0aW9uLCBwcm92aWRpbmcgYSBjb25zaXN0ZW50IHNldCBvZgp2ZXJicyB0aGF0IGhlbHAgeW91IHNvbHZlIHRoZSBtb3N0IGNvbW1vbiBkYXRhIG1hbmlwdWxhdGlvbiBjaGFsbGVuZ2VzOgoKICAKMS4gYHNlbGVjdCgpYCBwaWNrcyB2YXJpYWJsZXMgYmFzZWQgb24gdGhlaXIgbmFtZXMuCgoyLiBgbXV0YXRlKClgIGFkZHMgbmV3IHZhcmlhYmxlcyB0aGF0IGFyZSBmdW5jdGlvbnMgb2YgZXhpc3RpbmcKdmFyaWFibGVzLgoKMy4gYHN1bW1hcmlzZSgpYCByZWR1Y2VzIG11bHRpcGxlIHZhbHVlcyBkb3duIHRvIGEgc2luZ2xlIHN1bW1hcnkuCgo0LiBgYXJyYW5nZSgpYCBjaGFuZ2VzIHRoZSBvcmRlcmluZyBvZiB0aGUgcm93cy4KCjUuIGBmaWx0ZXIoKWAgcGlja3MgY2FzZXMgYmFzZWQgb24gdGhlaXIgdmFsdWVzLgoKCgoKCgpgYGB7cn0Kc291cmNlKCJzY3JpcHRfbW5zLlIiKQojIGxpYnJhcnkoIm55Y2ZsaWdodHMxMyIpCiMgc2F2ZVJEUyhmbGlnaHRzLCJkYXRhL2ZsaWdodHMucmRzIikKIyBzYXZlUkRTKGFpcmxpbmVzLCJkYXRhL2FpcmxpbmVzLnJkcyIpCiMgbGlicmFyeSh0aWR5dmVyc2UpCiMgbXBnCgpgYGAKCmBgYHtyfQpmbGlnaHRzPC1yZWFkUkRTKCJkYXRhL2ZsaWdodHMucmRzIikKYWlybGluZXM8LXJlYWRSRFMoImRhdGEvYWlybGluZXMucmRzIikKZmxpZ2h0cwpgYGAKCiMjIyAqKmBzZWxlY3RgKioKYGBge3J9CiMgRHJvcCB1bmltcG9ydGFudCB2YXJpYWJsZXMgc28gaXQncyBlYXNpZXIgdG8gdW5kZXJzdGFuZCB0aGUgam9pbiByZXN1bHRzLgpmbGlnaHRzICU+JSBzZWxlY3QoeWVhcjpkYXksIGhvdXIsIG9yaWdpbiwgZGVzdCwgdGFpbG51bSwgY2FycmllcikKYGBgCgoKIyMjICoqYG11dGF0ZWAqKgpgYGB7cn0KZmxpZ2h0cyAlPiUgbXV0YXRlKElfdGFyZGU9aWZfZWxzZShhcnJfZGVsYXk+MCwxLDApKQpgYGAKCmBgYHtyfQpmbGlnaHRzJT4lc2VsZWN0KHNjaGVkX2Fycl90aW1lLGFycl90aW1lLGFycl9kZWxheSklPiVtdXRhdGUoSV90YXJkZT1pZl9lbHNlKGFycl9kZWxheT4wLDEsMCkpCmBgYAoKIyMjICoqYGFycmFuZ2VgKioKYGBge3J9CmZsaWdodHMgJT4lIGFycmFuZ2UoeWVhcikKYGBgCmBgYHtyfQpmbGlnaHRzICU+JSBhcnJhbmdlKHllYXIsbW9udGgsZGF5KQpgYGAKYGBge3J9CmZsaWdodHMlPiUKICBzZWxlY3QoeWVhcixtb250aCxzY2hlZF9kZXBfdGltZSxzY2hlZF9hcnJfdGltZSxhcnJfdGltZSxhcnJfZGVsYXkpJT4lCiAgbXV0YXRlKElfdGFyZGU9aWZfZWxzZShhcnJfZGVsYXk+MCwxLDApKSU+JQogIGFycmFuZ2UoZGVzYyh5ZWFyKSxkZXNjKG1vbnRoKSxzY2hlZF9kZXBfdGltZSxkZXNjKElfdGFyZGUpKQpgYGAKCiMjIyAqKmBmaWx0ZXJgKioKYGBge3J9CmZsaWdodHMlPiUKICBmaWx0ZXIoc2NoZWRfZGVwX3RpbWU8MTIwMCkKYGBgCgpgYGB7cn0KZmxpZ2h0cyU+JQogIHNlbGVjdCh5ZWFyLG1vbnRoLHNjaGVkX2RlcF90aW1lLHNjaGVkX2Fycl90aW1lLGFycl90aW1lLGFycl9kZWxheSklPiUKICBtdXRhdGUoSV90YXJkZT1pZl9lbHNlKGFycl9kZWxheT4wLDEsMCkpJT4lCiAgZmlsdGVyKHNjaGVkX2RlcF90aW1lPj0wMDAxICYgc2NoZWRfZGVwX3RpbWU8NjAwICYgSV90YXJkZT09MSkgCmBgYAoKYGBge3J9CmZsaWdodHMyPC1mbGlnaHRzJT4lCiAgbXV0YXRlKElfdGFyZGU9aWZfZWxzZShhcnJfZGVsYXk+MCwxLDApKQpzdW0oaXMubmEoZmxpZ2h0czIkSV90YXJkZSkpOyBwcmludChwYXN0ZTAoIlRoZSAnTkEncyByZXByZXNlbnQgdGhlOiAiLCByb3VuZChzdW0oaXMubmEoZmxpZ2h0czIkSV90YXJkZSkpL25yb3coZmxpZ2h0czIpLDQpKjEwMCwiJSBvZiB0aGUgVW5pdmVyc2UiKSkKYGBgCgpgYGB7cn0KZmxpZ2h0czIlPiUKICBmaWx0ZXIoaXMubmEoSV90YXJkZSkpCmBgYAoKKk9wdGlvbiAxOiBUYWtlIHRoZW0gb3V0IG9mIGhlcmUqCmBgYHtyfQpmbGlnaHRzMl9vcHQxPC1mbGlnaHRzMiU+JQogIGZpbHRlcighaXMubmEoSV90YXJkZSkpCgpmbGlnaHRzMl9vcHQxCmBgYAoKKk9wdGlvbiAyOiBTcGVjaWFsIFZhbHVlKgpgYGB7cn0KZmxpZ2h0czJfb3B0MjwtZmxpZ2h0czIlPiUKICBtdXRhdGUoSV90YXJkZT1pZl9lbHNlKGlzLm5hKElfdGFyZGUpLC05OTk5LElfdGFyZGUpKQoKZmxpZ2h0czJfb3B0Mjsgc3VtKGlzLm5hKGZsaWdodHMyX29wdDIkSV90YXJkZSkpOyBwcmludChwYXN0ZTAoIlRoZSAnTkEncyByZXByZXNlbnQgdGhlOiAiLCByb3VuZChzdW0oaXMubmEoZmxpZ2h0czJfb3B0MiRJX3RhcmRlKSkvbnJvdyhmbGlnaHRzMl9vcHQyKSw0KSoxMDAsIiUgb2YgdGhlIFVuaXZlcnNlIikpOyBzdW0oZmxpZ2h0czJfb3B0MiRJX3RhcmRlPT0tOTk5OSk7IHByaW50KHBhc3RlMCgiVGhlIFNwZWNpYWwgVmFsdWVzIHJlcHJlc2VudCB0aGU6ICIsIHJvdW5kKHN1bShmbGlnaHRzMl9vcHQyJElfdGFyZGU9PS05OTk5KS9ucm93KGZsaWdodHMyX29wdDIpLDQpKjEwMCwiJSBvZiB0aGUgVW5pdmVyc2UiKSkKYGBgCgoqT3B0aW9uIE4gJiBiZXlvbmQ6IE1vcmUgc29waGlzdGljYXRlZCBzdHVmZioKYGBge3J9CnByaW50KGltcG9ydGFudF9tZW5zKQpgYGAKCiMjIEdyb3VwZWQgb3BlcmF0aW9ucwoKIyMjICoqYGdyb3VwX2J5YCoqCgotIENvdW50aW5nIG51bWJlciBvZiBmbGlnaHRzIHBlciBkYXkKCmBgYHtyfQpmbGlnaHRzMl9vcHQyJT4lCiAgZ3JvdXBfYnkoeWVhcixtb250aCxkYXkpJT4lCiAgbXV0YXRlKG5fZmxpZ2h0cz1uKCkpCmBgYAoKCmBgYHtyfQpmbGlnaHRzMl9vcHQyJT4lCiAgZ3JvdXBfYnkoeWVhcixtb250aCxkYXkpJT4lCiAgbXV0YXRlKG5fZmxpZ2h0cz1uKCkjLAogICAgICAgICAjIGlkcm93PXJvd19udW1iZXIoKQogICAgICAgICApIyU+JQogICMgYXJyYW5nZSh5ZWFyLG1vbnRoLGRheSklPiUKICAjIHNlbGVjdCh5ZWFyLG1vbnRoLGRheSxuX2ZsaWdodHMsaWRyb3cpCiAgCmBgYApgYGAKCgoKCgoKCiNSZWZlcmVuY2VzCgotIFtHbyB0byBoaW1dKGh0dHBzOi8vZ2l0aHViLmNvbS90aWR5dmVyc2UvZHBseXIvYmxvYi9tYXN0ZXIvUkVBRE1FLm1kKQoKLSBbVGhlIGJlc3QgcGxhY2UgdG8gc3RhcnRdKGh0dHBzOi8vcjRkcy5oYWQuY28ubnovdHJhbnNmb3JtLmh0bWwpCgojIyBDaGVhdHNoZWV0Cgo8YSBocmVmPSJodHRwczovL2dpdGh1Yi5jb20vcnN0dWRpby9jaGVhdHNoZWV0cy9ibG9iL21hc3Rlci9kYXRhLXRyYW5zZm9ybWF0aW9uLnBkZiI+PGltZyBzcmM9Imh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9yc3R1ZGlvL2NoZWF0c2hlZXRzL21hc3Rlci9wbmdzL3RodW1ibmFpbHMvZGF0YS10cmFuc2Zvcm1hdGlvbi1jaGVhdHNoZWV0LXRodW1icy5wbmciIHdpZHRoPSI2MzAiIGhlaWdodD0iMjUyIi8+PC9hPiAgCgojIyBUd28tdGFibGUKW0xldCdzIGdvIGZvciBzb21lIGpvaW5zXShodHRwczovL2RwbHlyLnRpZHl2ZXJzZS5vcmcvYXJ0aWNsZXMvdHdvLXRhYmxlLmh0bWwpCgoK